Homework 9.1

Joe, Krystin, Rohan

In [1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot datashader bebi103 watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------
try:
    import multiprocess
except:
    import multiprocessing as multiprocess

import warnings
    
import numpy as np
import pandas as pd

import bebi103
import iqplot
import scipy
import scipy.stats as st
import holoviews as hv
import holoviews.operation.datashader
hv.extension('bokeh')

import bokeh
bokeh.io.output_notebook()
bebi103.hv.set_defaults()

import numpy.random
rg = numpy.random.default_rng()
import random
import numba
import panel as pn
from scipy.stats import gamma
Loading BokehJS ...

a) Exploratory Data Analyses

Attributions: Joe

First we load the data set and make it tidy.

In [2]:
# Download Data and wrangle into tidy format
df =  pd.read_csv("../data/gardner_mt_catastrophe_only_tubulin.csv", header=9)
df12 = pd.DataFrame(df['12 uM'])
df12.columns = ['time']
df7 = pd.DataFrame(df['7 uM'])
df9 = pd.DataFrame(df['9 uM'])
df10 = pd.DataFrame(df['10 uM'])
df14 = pd.DataFrame(df['14 uM'])
con = [12,7,9,10,14]
frames = [df12, df7, df9, df10, df14]

for i, df in enumerate(frames):
    # get name 
    name = df.columns.tolist()
    df.columns = ['time']
    df["concentration"] = con[i]
df = pd.concat(frames)
df = df.dropna()
df = df.sort_values('concentration')
df
Out[2]:
time concentration
267 230.0 7
23 75.0 7
24 80.0 7
25 80.0 7
26 80.0 7
... ... ...
47 325.0 14
48 325.0 14
49 330.0 14
43 320.0 14
140 1420.0 14

1920 rows × 2 columns

Now that the data is tidy, we can create some visualizations to better understand how microtubule catastrophe times are for different tubulin concentrations. We did this by first making a stripbox plot so.

In [3]:
p = iqplot.stripbox(
    data=df,
    x_axis_label = 'Time (s)',
    y_axis_label = 'Concentration (uM)',
    width=500,
    height=500,
    q='time',
    cats = 'concentration',
    jitter=True,
    top_level = "box",
    whisker_kwargs=dict(line_color='#00000f', line_width=.8),
    box_kwargs=dict(line_color='#00000f', line_width=.8),
    median_kwargs=dict(line_color='#00000f', line_width=.8),
    marker_kwargs=dict(alpha=0.3),
)

p1 = iqplot.ecdf(
    data=df,
    q='time',
    cats=['concentration'],
    kind='colored',
    conf_int=True,
)

bokeh.io.show(p)
bokeh.io.show(p1)

Above we have plotted a stripbox plot and the ecdf of the distribution of catastrophe times for the microtubules separated by the different concentrations of total tubulin. From the above plots, we can see that the concentration of tubulin does not affect the time it takes to catastrophe. This is best seen in the above ECDF plot where we can see that the distribution of catastrophe times for the different concentrations of tubulin are nearly identical.

b) Choosing a preferred Model

Now, we will use our work from HW 8.2 to create two different potential generative models for the distribution of catastrophe times with microtubules with a 12 uM concentration of tubulin.

Gamma Distribution

In [16]:
def log_like_iid_gamma(params, n):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by alpha, beta."""
    alpha, beta = params

    if alpha <= 0 or beta <= 0:
        return -np.inf

    return np.sum(st.gamma.logpdf(n, alpha, loc = 0, scale = 1/beta))


def mle_iid_gamma(n):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    gamma measurements, parametrized by alpha, beta"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_gamma(params, n),
            x0=np.array([1, 1/2]),
            args=(n,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)
        
def gen_fun_gamma(params, n, size, rg):
    alpha, beta = params
    return st.gamma.rvs(alpha, loc = 0, scale = 1/beta, size = size)

        
data = df.loc[df['concentration'] == 12, 'time']

gamma_mle = mle_iid_gamma(data)


bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
    mle_iid_gamma,
    gen_fun_gamma,
    data,
    gen_args=(data, ),
    size=1000,
    n_jobs=3,
    progress_bar=True,
)

conf = np.percentile(bs_reps, [2.5, 97.5], axis=0)

alpha_g = gamma_mle[0]
beta_g = gamma_mle[1]

print('''alpha
      MLE: {}
      95% Confidence Interval {}
          '''.format(gamma_mle[0], conf[:, 0]))
print('''beta
      MLE: {}
      95% Confidence Interval {}
          '''.format(gamma_mle[1], conf[:, 1]))
 12%|█▏        | 41/333 [00:02<00:16, 18.09it/s]Process ForkPoolWorker-24:
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-16-9c0ee10f7e9c> in <module>
     38 
     39 
---> 40 bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
     41     mle_iid_gamma,
     42     gen_fun_gamma,

/opt/anaconda3/lib/python3.8/site-packages/bebi103/bootstrap.py in draw_bs_reps_mle(mle_fun, gen_fun, data, mle_args, gen_args, size, n_jobs, progress_bar, rg)
    647 
    648     with multiprocess.Pool(n_jobs) as pool:
--> 649         result = pool.starmap(_draw_bs_reps_mle, arg_iterable)
    650 
    651     return np.concatenate(result)

/opt/anaconda3/lib/python3.8/site-packages/multiprocess/pool.py in starmap(self, func, iterable, chunksize)
    370         `func` and (a, b) becomes func(a, b).
    371         '''
--> 372         return self._map_async(func, iterable, starmapstar, chunksize).get()
    373 
    374     def starmap_async(self, func, iterable, chunksize=None, callback=None,

/opt/anaconda3/lib/python3.8/site-packages/multiprocess/pool.py in get(self, timeout)
    763 
    764     def get(self, timeout=None):
--> 765         self.wait(timeout)
    766         if not self.ready():
    767             raise TimeoutError

/opt/anaconda3/lib/python3.8/site-packages/multiprocess/pool.py in wait(self, timeout)
    760 
    761     def wait(self, timeout=None):
--> 762         self._event.wait(timeout)
    763 
    764     def get(self, timeout=None):

/opt/anaconda3/lib/python3.8/threading.py in wait(self, timeout)
    556             signaled = self._flag
    557             if not signaled:
--> 558                 signaled = self._cond.wait(timeout)
    559             return signaled
    560 

/opt/anaconda3/lib/python3.8/threading.py in wait(self, timeout)
    300         try:    # restore state no matter what (e.g., KeyboardInterrupt)
    301             if timeout is None:
--> 302                 waiter.acquire()
    303                 gotit = True
    304             else:

KeyboardInterrupt: 

Arrival of two Poisson processes (with different rates)

In [5]:
def log_like_iid_exp(params, t):
    """Log likelihood for i.i.d. NBinom measurements, parametrized
    by beta1, beta2."""
    beta_1, beta_2 = params
    dB = beta_2 - beta_1

    if beta_1 <= 0 or beta_2 <= 0:
        return -np.inf
    if not (dB >= 0):
        return -np.inf
    # If beta 1 ~ beta 2, we use a different pdf with just beta_1
    if np.abs(beta_1 - beta_2) < .0001:
        pdf = (beta_1**2) * t * np.exp(-beta_1*t)
    
    else: 
        c = (beta_1 * (beta_1 + dB))/dB
        pdf = c * np.exp(-beta_1*t)*(1-np.exp(-dB*t))

    return np.sum(np.log(pdf))


def mle_iid_exp(times):
    """Perform maximum likelihood estimates for parameters for i.i.d.
    gamma measurements, parametrized by beta1, beta2"""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")

        res = scipy.optimize.minimize(
            fun=lambda params, n: -log_like_iid_exp(params, times),
            x0=np.array([.01, .011]),
            args=(times,),
            method='Powell'
        )

    if res.success:
        return res.x
    else:
        raise RuntimeError('Convergence failed with message', res.message)
In [6]:
mle_exp = mle_iid_exp(np.array(data))

def gen_fun_exp(params, n, size, rg):
    beta_1, beta_2 = params
    b_1 = rg.exponential(1/beta_1, size)
    b_2 = rg.exponential(1/beta_2, size)
    return b_1 + b_2

bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
    mle_iid_exp,
    gen_fun_exp,
    data,
    gen_args=(data, ),
    size=2000,
    n_jobs=3,
    progress_bar=True,
)

conf_exp = np.percentile(bs_reps, [2.5, 97.5], axis=0)

beta1_exp = mle_exp[0]
beta2_exp = mle_exp[1]

print('''beta_1
      MLE: {}
      95% Confidence Interval {}
          '''.format(mle_exp[0], conf_exp[0,:]))
print('''beta_2
      MLE: {}
      95% Confidence Interval {}
          '''.format(mle_exp[1], conf_exp[1, :]))
100%|██████████| 668/668 [00:04<00:00, 135.39it/s]

100%|██████████| 666/666 [00:04<00:00, 134.06it/s]
beta_1
      MLE: 0.005034534882527375
      95% Confidence Interval [0.00366362 0.0057279 ]
          
beta_2
      MLE: 0.005366033634120073
      95% Confidence Interval [0.00466679 0.00929894]
          

Attributions below: Krystin

Use predictive ECDFs to compare models

To compare the two models, we decided to use predictive ECDFs to determine whether MLE for parameters were a good representation of the true generative model, so we plotted how data generated from both models looked. We used parametric bootsrapping to generate data sets and then plot confidence intervals of the resulting ECDFs, with the measured ECDF overlayed. We used predictive_ecdf from the bebi103 package and see the plots generated below. We also decided to plot the difference between the predictive ECDF and the measured ECDF. By plotting the difference, we can more easily see which model is best.

In [9]:
times = data
rg = np.random.default_rng()

gamma_samples = np.array(
    [rg.gamma(alpha_g, 1 / beta_g, size=len(data)) for _ in range(1000)]
)

exp_samples = np.array(
    [rg.exponential(1/beta1_exp, size=len(data)) + rg.exponential(1/beta2_exp, size=len(data)) for _ in range(1000)]
)
In [10]:
# Gamma

p1 = bebi103.viz.predictive_ecdf(
    samples=gamma_samples, data=times, discrete=True, x_axis_label="n"
)

bokeh.io.show(p1)
In [11]:
# Gamma

p1 = bebi103.viz.predictive_ecdf(
    samples=gamma_samples, data=times, diff='ecdf', discrete=True, x_axis_label="n"
)

bokeh.io.show(p1)

For the figures above, the lighter blue contains the middle 95% of the generated ECDFs while the darker blue captures the middle 68%. From these plots, we can see that a lot of measured ECDF lies within 95% of the Gamma distributed ECDF.

In [12]:
# Exponential

p2 = bebi103.viz.predictive_ecdf(
    samples=exp_samples, data=times, discrete=True, x_axis_label="n", color="red",
)

bokeh.io.show(p2)
In [13]:
# Exponential

p2 = bebi103.viz.predictive_ecdf(
    samples=exp_samples, data=times, diff='ecdf', discrete=True, x_axis_label="n", color="red",
)

bokeh.io.show(p2)

For these figures above, it is more apparent that two successive Poisson processes generative model does not fit the data as well, since most of the points lie outside of the 95% confidence.

In [14]:
p1 = bebi103.viz.predictive_ecdf(
    samples=gamma_samples, data=times, discrete=True, x_axis_label="n"
)

p2 = bebi103.viz.predictive_ecdf(
    samples=exp_samples, data=times, discrete=True, x_axis_label="n", p=p1, color="red",
)

bokeh.io.show(p1)

The above plot measures shows the CDFs of the gamma distribution (yellow), the difference in two poisson arrival distributions (red), and the eCDF of the catastrophe times for microtubules with a 12 uM concentration of tubulin. The above graph shows that the gamma distribution more closely resembles the distribution of catastrophe times compared to the difference in Poisson distribution. Therefore, we will be using the gamma distribution to model the catastrophe times of microtubules moving forward.

Code: Joe

c) Obtain parameter estimates (for alpha and beta) for the other tubulin concentrations.

In [15]:
mles = []
for i, concentration in enumerate([7, 9, 10, 12, 14]):
    data = pd.DataFrame(df.loc[df['concentration'] == concentration, 'time'])

    gamma_mle = mle_iid_gamma(data)

    bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
        mle_iid_gamma,
        gen_fun_gamma,
        data,
        gen_args=(data, ),
        size=1000,
        n_jobs=3,
        progress_bar=False,
    )

    conf = np.percentile(bs_reps, [2.5, 97.5], axis=0)
    
    mles.append(gamma_mle)
    print('Conentration : {} uM'.format(concentration))
    print('''alpha
          MLE: {}
          95% Confidence Interval {}
              '''.format(gamma_mle[0], conf[:, 0]))
    print('''beta
          MLE: {}
          95% Confidence Interval {}
          
              '''.format(gamma_mle[1], conf[:, 1]))
Conentration : 7 uM
alpha
          MLE: 2.4435619789520313
          95% Confidence Interval [2.21886514 2.73651842]
              
beta
          MLE: 0.007550753626328544
          95% Confidence Interval [0.0068392  0.00871728]
          
              
Conentration : 9 uM
alpha
          MLE: 2.6796794163921733
          95% Confidence Interval [2.3117431  3.17034229]
              
beta
          MLE: 0.008779451230912894
          95% Confidence Interval [0.00751938 0.01064915]
          
              
Conentration : 10 uM
alpha
          MLE: 3.2107888557992066
          95% Confidence Interval [2.73261667 3.80284501]
              
beta
          MLE: 0.009029894508801357
          95% Confidence Interval [0.00764746 0.01106814]
          
              
Conentration : 12 uM
alpha
          MLE: 2.9152792725986476
          95% Confidence Interval [2.66856862 3.24304299]
              
beta
          MLE: 0.007660623061282436
          95% Confidence Interval [0.00694307 0.00858692]
          
              
Conentration : 14 uM
alpha
          MLE: 3.3613691846967186
          95% Confidence Interval [2.75413646 4.24449641]
              
beta
          MLE: 0.007175192148528065
          95% Confidence Interval [0.00573597 0.0092493 ]
          
              

Krystin:

Looking at the values of the parameters versus tubulin concentration, we can see that $\alpha$, the number of arrivals of microtubule catastrophe, generally increases, with the exception of concentration 12uM where the calculated $\alpha$ and corresponding confidence intervals, are smaller than that for concentration 10uM. Meanwhile, $\beta$, the rate of microtubule catastrophe is largest at 10uM and decreases towards the tails 7uM and 14uM. It is difficult to conclusively say something about which tubulin concentration will yield the fastest and most often MT catastrophes, but we can say that when tubulin concentrations are greater than or equal to 10uM, there are more occurances of MT catastrophe. This could mean that there is a threshold of tubulin concentration for which microtubules can utilize for polymerization.

In [10]:
%load_ext watermark

%watermark -v -p numpy,pandas,jupyterlab
CPython 3.8.5
IPython 7.18.1

numpy 1.19.1
pandas 1.1.1
jupyterlab 2.2.6